Refactor CI: pull chroma from Docker image, track SHA via Heroku conf…#6
Refactor CI: pull chroma from Docker image, track SHA via Heroku conf…#6Gabrielebattimelli merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors the weekly indexing GitHub Actions workflow to make indexing incremental by restoring chroma/ from the currently deployed Heroku Docker image and tracking the last indexed PhysLib commit SHA via a Heroku config var (instead of committing state back to Heroku Git).
Changes:
- Install Heroku CLI in the workflow and pull
chroma/fromregistry.heroku.com/...to seed incremental runs. - Replace
.last-physlib-shafile tracking withLAST_PHYSLIB_SHAstored as a Heroku config var. - Remove the “push updated index to Heroku git remote” path and rely on rebuilding/releasing the container image.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Pull chroma/ from the live Docker image so the pipeline runs incrementally | ||
| - name: Extract ChromaDB from current Heroku image | ||
| env: | ||
| HEROKU_API_KEY: ${{ secrets.HEROKU_API_KEY }} | ||
| run: | | ||
| git remote add heroku "https://heroku:$HEROKU_API_KEY@git.heroku.com/physlibsearch.git" | ||
|
|
||
| # Restore chroma/ and the SHA tracker from the heroku git remote | ||
| - name: Restore ChromaDB and SHA tracker | ||
| run: | | ||
| git fetch heroku main | ||
| git checkout heroku/main -- chroma/ .last-physlib-sha 2>/dev/null \ | ||
| || echo "No prior index on heroku/main — starting fresh." | ||
| heroku container:login | ||
| docker pull registry.heroku.com/physlibsearch/web || echo "No existing image — starting fresh." | ||
| CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true | ||
| if [ -n "$CID" ]; then | ||
| docker cp "$CID:/app/chroma" . 2>/dev/null || echo "No chroma/ in image — starting fresh." | ||
| docker rm "$CID" | ||
| fi |
There was a problem hiding this comment.
The ChromaDB extraction step runs unconditionally, even when PhysLib hasn't changed and the rest of the job is skipped. This adds a potentially large docker pull to every scheduled run; consider moving the PhysLib change check earlier (or gating the extraction step on has_changes) so the workflow can exit quickly when there's nothing to update.
| run: | | ||
| CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1) |
There was a problem hiding this comment.
CURRENT_SHA=$(git ls-remote ... | cut -f1) can silently produce an empty SHA if git ls-remote fails, because without pipefail the pipeline exit status is from cut. That can make the workflow treat a transient network failure as a change and later persist an empty/incorrect LAST_PHYSLIB_SHA. Add set -o pipefail (or avoid the pipeline) and explicitly fail if CURRENT_SHA is empty.
| run: | | |
| CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1) | |
| run: | | |
| set -o pipefail | |
| CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1) | |
| if [ -z "$CURRENT_SHA" ]; then | |
| echo "Failed to resolve current PhysLib HEAD SHA from $PHYSLIB_REPO" >&2 | |
| exit 1 | |
| fi |
| run: | | ||
| CURRENT_SHA=$(git ls-remote "$PHYSLIB_REPO" HEAD | cut -f1) | ||
| LAST_SHA=$(cat .last-physlib-sha 2>/dev/null || echo "") | ||
| LAST_SHA=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null || echo "") |
There was a problem hiding this comment.
heroku config:get ... 2>/dev/null || echo "" suppresses all errors (including auth/permission issues) and treats them as "no prior SHA", which can trigger an unnecessary full reindex and then overwrite the stored SHA. Consider only defaulting to empty when the config var is genuinely missing, and otherwise failing the job so misconfiguration doesn't silently reset state.
| LAST_SHA=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null || echo "") | |
| LAST_SHA_OUTPUT=$(heroku config:get LAST_PHYSLIB_SHA --app physlibsearch 2>/dev/null) | |
| HEROKU_CONFIG_GET_STATUS=$? | |
| if [ "$HEROKU_CONFIG_GET_STATUS" -ne 0 ]; then | |
| echo "Failed to read LAST_PHYSLIB_SHA from Heroku app physlibsearch." >&2 | |
| exit "$HEROKU_CONFIG_GET_STATUS" | |
| fi | |
| LAST_SHA=$LAST_SHA_OUTPUT |
| docker pull registry.heroku.com/physlibsearch/web || echo "No existing image — starting fresh." | ||
| CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true |
There was a problem hiding this comment.
docker pull ... || echo "No existing image" treats any pull failure (network hiccup, rate limit, registry outage) as "no existing image" and proceeds with a fresh index, which can hide operational issues and cause unexpectedly long runs. Consider failing on pull errors except for the specific "manifest unknown"/"not found" case.
| docker pull registry.heroku.com/physlibsearch/web || echo "No existing image — starting fresh." | |
| CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true | |
| CID="" | |
| PULL_OUTPUT=$(docker pull registry.heroku.com/physlibsearch/web 2>&1) | |
| PULL_STATUS=$? | |
| if [ "$PULL_STATUS" -eq 0 ]; then | |
| echo "$PULL_OUTPUT" | |
| CID=$(docker create registry.heroku.com/physlibsearch/web 2>/dev/null) || true | |
| elif printf '%s\n' "$PULL_OUTPUT" | grep -qiE 'manifest unknown|not found'; then | |
| echo "No existing image — starting fresh." | |
| else | |
| echo "$PULL_OUTPUT" | |
| exit "$PULL_STATUS" | |
| fi |
| heroku container:push web --app physlibsearch | ||
| heroku container:release web --app physlibsearch | ||
| heroku config:set LAST_PHYSLIB_SHA=${{ steps.check.outputs.current_sha }} --app physlibsearch |
There was a problem hiding this comment.
The Heroku app name (physlibsearch) is hard-coded in several commands (registry path, config:get, container:*). To reduce duplication and the chance of inconsistent edits later, consider defining it once as an env var (e.g., HEROKU_APP) and reusing it across steps.
| - name: Install Heroku CLI | ||
| run: curl https://cli-assets.heroku.com/install.sh | sh |
There was a problem hiding this comment.
Installing the Heroku CLI via curl ... | sh executes a remote script without verification, which is a supply-chain risk. Prefer a pinned install method (e.g., package manager with version pinning, or downloading a specific release artifact and verifying its checksum/signature).
| - name: Install Heroku CLI | |
| run: curl https://cli-assets.heroku.com/install.sh | sh | |
| - name: Set up Node.js | |
| uses: actions/setup-node@v4 | |
| with: | |
| node-version: '20' | |
| - name: Install Heroku CLI | |
| run: npm install --global heroku@9.3.0 |
…ig var